Turn-taking model

ABSTRACT

A method is claimed for managing interactive dialog between a machine and a user. In one embodiment, an interaction between the machine and the user is managed in response to a timing position of possible speech onset from the user. In another embodiment, the interaction between the machine and the user is dependent upon the timing of a recognition result, which is relative to a cessation of a verbalization of a desired sequence from the machine. In another embodiment, the interaction between the machine and the user is dependent upon a recognition result and whether the desired sequence was ceased or not ceased.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority from U.S. Provisional PatentApplication entitled “TUI DESIGN TURN TAKING” by Attwater et al., filedDec. 22, 2004, Ser. No. 60/638,431, and is hereby incorporated byreference.

FIELD OF THE INVENTION

The present invention relates to a turn-taking state machine, and moreparticularly, to a turn-taking model for an interactive system thathandles an interaction between a person and a machine which uses speechrecognition.

DESCRIPTION OF THE RELATED ART

The term ‘Turn-taking’ refers to the pattern of interaction which occurswhen two or more people communicate using spoken language. At any givenmoment one or all of the people in a conversation may be speaking,thinking of speaking or remaining silent. Turn-taking is the protocol bywhich the participants in the conversation decide whether, and when, itis their turn to speak.

The normal pattern of turn-taking is for one person to speak at a time.There are however many instances where speakers overlap their speech.Turns evolve over time and have internal structure. A ‘turn’ mayactually comprise a number of “turns-so-far”—termed Turn ConstructionalUnits (TCU). TCUs will often be capable of forming turns in their ownright—i.e. a turn is made up of smaller turns. For clarity, in thisdescription we shall adopt the term ‘Move’ for a TCU, where turns aremade up of one or more moves. Boundaries between moves form potentialturn-taking boundaries.

At some move boundaries (e.g. at the end of a question), the speakerwill elect for another speaker to take over at that point. Even whenspeakers elect to give the turn away, they may choose to re-claim itagain if the listener chooses not to respond. Listeners may of courseself-select to take turns at other points.

Turn-taking can also exist in a conversation between a machine and aperson or ‘user’. Just as in Human-Human conversation, Human-Machineconversation must deal with the phenomena of interruptions. When one ofthe conversants—say the machine—is speaking then the otherconversant—the user—may choose to remain silent or interrupt at anymoment. The response of the machine to such interruption will define itsturn-taking policy. On interruption, for example, it may choose to‘hold’ the ‘floor’ or ‘yield’ it.

The use of the term ‘floor’ is by analogy to formal debate wherespeakers are given the ‘floor’ in order to express their views one at atime. To ‘hold’ the floor is to continue speaking whilst beinginterrupted. To ‘yield’ the floor is to either stop speaking on aninterruption (a self-selected turn-change)—or to stop speaking to givesomeone the chance to speak next (an elective turn change).

Yielding the Floor

The most common turn-taking patterns are orderly transitions atturn-taking boundaries. These comprise three primary patterns. FIG. 1Ashows an orderly transition of a turn where speaker A pauses, speaker Bspeaks in the pause, and Speaker A yields the floor (i.e. lets speaker Bcontinue). Such patterns generally occur at elective turn-takingboundaries i.e. where the speaker intends the other person to speak.FIG. 1B shows a similarly ordered transition but where the speaker B hasanticipated the turn-transition point at the end of A1. Speaker A hasyielded in a similar manner. This pattern may occur in electivetransitions or self-selected ones. That is to say, speaker A may havehad something planned to say following turn A1 but chose not to continuewith the planned utterance. Finally, FIG. 1C shows the case wherespeaker B started to respond to utterance A1, but it was a late responseand speaker A has begun the next move. Speaker A interprets thisinterruption as a response to turn A1 and immediately backs-off to letspeaker B1 continue. It should be noted that overlapping speech such asthat shown in the two examples is more common in telephone conversationsthan in face-to-face conversations.

Holding the Floor

The other less frequent, but still significant pattern seen inhuman-human conversation is floor-holding. Here the speaker chooses tohold the conversational floor in the presence of an interruption fromthe other speaker. There can be many different reasons for doingso—which will depend amongst other things on the topic, the nature ofthe task, and relative social relationships between the speakers

FIGS. 1D and 1E show two examples of floor-holding by speaker A in thepresence of an interruption at or around the boundary of two moves.There are several other patterns, depending on the point of interruptionand duration of the interruption. The common feature of these patternsis that speaker B backs-off and allows speaker A to continue. In suchcircumstances, speaker B will generally hold the planned utterance B1and produce it later if it is still relevant given the context. Floorholding often causes the thwarted speaker to re-produce the abortedutterance at the next turn-taking opportunity. FIG. 1F shows thesituation where speaker B has interrupted speaker A in the middle of amove. Speaker A has ignored the interruption and continued speaking.Such interruptions are not common in human-human task orientedconversation, but do occur. Examples include speaking along with thegiver (for example during a number confirmation), side comments to otherparticipants, and strongly competitive interruptions. It is of interestthat in automated dialogues, such occurrences are much more common dueto the unusual social contract between man and system—and the deliberateencouragement of the use of barge-in as a user-interface device.

Mutual Back-Off and Re-Starts

On occasions in human-human conversation, when the conversants clash,they both choose to back-off. FIG. 1G shows an example of this. Whenboth conversants have backed-off, an ambiguous state has occurred in thenatural turn-taking protocol, and oscillations can occur with repeatedback-offs. It often becomes necessary to recourse to meta-dialogue—“youfirst!”, which of course can also clash. Such clashes are quite commonin current man-machine dialogues employing “barge-in” for reasons whichwill be discussed later. “Barge-in” refers to one conversant explicitlyspeaking while the other conversant has the floor for the purpose ofcreating an interruption.

Current day automated systems that deal with Turn-Taking between a userand a machine use either Half-Duplex or Full-Duplex mechanisms. Patternswhich are seen in half-duplex systems are:

The prompt is never stopped. Speakers are ignored whilst the prompt isplaying. If they are still speaking at the end of theprompt—spoke-too-soon conditions are thrown by the recognizer. In thissituation tones can be used to denote that users must repeat theirutterances. This could be termed an “Always-Hold” protocol.

Patterns which are seen in current full-duplex systems are:

The prompt is stopped when speech input is detected, sometimes after ashort delay. Echo cancellation is used to clean-up any overlapping ofthe speech. Recognition of the signal is performed which returns whenconfident result is detected, usually relying on a short period ofsilence to determine this juncture. This result is returned to theapplication which decides what to do next. The response is assumed torelate to the prompt which was stopped. Uncertain recognition usuallywill result in a repeat of the previous prompt or something with similarmeaning along with optional error messages. Systems generally vary inhow quickly they cut the prompt after speech is detected. This can bedue to:

-   -   a. Autonomic prompt cut on speech detect—(may be slight inherent        delay)    -   b. Deliberate checking of the initial signal to check whether it        looks like valid speech.    -   c. Recognition of whole utterance up to band-silence.        These options could be labeled as the following strategies.

(a) “Always-Yield”

(b) “Yield on Speech”

(c) “Yield when confident”

Current speech user-interface designs generally use ‘barge-in’ for oneof two purposes—although they are rarely distinguished in theliterature. These are:

-   -   1) Barge-in as a user-interface device: The user understands        that they can interrupt machine prompts at any time. They        generally know the available keywords and consciously choose to        interrupt the machine as an explicit act.    -   2) Barge-in to manage turn-taking overlaps: The user interrupts        the end of a machine prompt as a natural overlapping turn-taking        behavior due to anticipation of the turn-taking juncture. The        behavior is generally autonomic although it can be modified with        conscious effort.        The confusion between the two is generally compounded by the        common practice of recording multiple turns, or phrases with        internal turn-taking junctures in a single prompt.

The problem with the use of barge-in for the second purpose is that thetechnology displays universal behavior regardless of where in the promptthe interruption occurs. This often leads to serious errors which areamplified by the user interface. A prompt may be cut-off by anextraneous noise and almost immediately it begins to play. This willthen return a rejected result from the speech recognizer. Theapplication designer then interprets this as a user behavior error, andenters error correction—designed to correct an error in response to aprompt which the user has not yet heard. The result is generallyunstable user-interface performance, particularly in the presence ofnoise.

The other problem often observed with current barge-in technologyresults from delays between the detection of an interruption and thecutoff of the prompt. As described above, this can be due to inherenttechnology limitations, or by deliberate design in an attempt to avoidthe false cut-off problem described above. The result however is thatthe interrupting user perceives that the machine is ‘holding-the-floor’,and therefore backs-off their own speech just as the machine shuts offits own prompt. Then machine and user are in a race for who will speakfirst, and turn-clashing can occur cyclically and unpredictably.

The final problem seen in current state-of-the-art is interruptions atthe start of prompts which are delayed responses to the previous phrase.In general this does not result in an obvious error—if the same grammarand dialogue state persist between the two phases. However, in designswhich make default transitions of dialogue state between phrases thiscan result in dialogue state errors.

SUMMARY OF THE INVENTION

A method is disclosed for managing interactive dialog between a machineand a user. In one embodiment, an interaction between the machine andthe user is managed in response to a timing position of possible speechonset from the user. In another embodiment, the interaction between themachine and the user is dependent upon the timing of a recognitionresult, which is relative to a cessation of a verbalization of a desiredsequence from the machine. Further, the interaction between the machineand the user is dependent upon a recognition result and whether thedesired sequence was ceased or not ceased.

DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure and theadvantages described herein, reference is now made to the followingdescriptions taken in conjunction with the accompanying drawings, inwhich:

FIGS. 1A-1G illustrate common turn-taking patterns seen in human tohuman conversations;

FIG. 2 illustrates a state diagram representing a turn-taking model fora system-user turn.

FIG. 3 illustrates three zones of turn-taking yield and hold behavior ina move;

FIG. 4 illustrates a state diagram depicting the three zones ofturn-taking yield and hold behavior in a move;

FIG. 5 illustrates a detailed diagram of a SALT Automatic Listen Mode;

FIG. 6 illustrates a detailed diagram of a SALT Multiple Listen Mode;

FIG. 7 illustrates a state diagram representing a turn-taking model fora system-user turn wherein HoldTimeout=0 and the on PreHoldTimeouttransition is removed;

FIG. 8 illustrates a question answering device which causes restarts inthe presence of noise;

FIG. 9 illustrates a turn taking state machine which is extended toperform restarts in the presence of noise;

FIG. 10 illustrates an alternative embodiment of a question answeringdevice which causes restarts in the presence of noise; and

FIG. 11 illustrates an alternative method of calculating HoldTimeoutsbased on the point of speech onset.

DETAILED DESCRIPTION

In the following discussion, numerous specific details are set forth toprovide a thorough understanding of the present invention. However,those skilled in the art will appreciate that the present invention maybe practiced without such specific details. In other instances,well-known elements have been illustrated in schematic or block diagramform in order not to obscure the present invention in unnecessarydetail. Additionally, for the most part, details concerning networkcommunications, and the like, have been omitted inasmuch as such detailsare not considered necessary to obtain a complete understanding of thepresent invention, and are considered to be within the understanding ofpersons of ordinary skill in the relevant art.

The turn taking design of this disclosure attempts to model theturn-taking process more explicitly than the prior art by selecting“Hold” or “Yield” strategies based on:

a) point of interruption in the prompt; and/or

b) explicit model of turn-taking and back-off.

Machine Turns and Moves

From the perspective of this disclosure a turn is the period from whicha machine starts speaking through to where it decides that a significantuser-event has occurred which needs application logic to respond to it.It is thus an autonomic state machine responding primarily to localinformation managing the basic sharing of the speech channel between twointerlocutors—in this case the machine and the user.

If the user remains silent, a machine turn can be formulated in advanceto be a sequence of spoken phrases (or moves) which will be spoken bythe machine in sequential order until it requires a response in order tomove forwards.

An example turn would be:

Please select one of the following: news, sport and weather.

This could be considered to be made up of four Moves:

[Please select one of the following] [news] [sport] [and weather].

The selection of what constitutes a move is not mandated by this design.It is however anticipated that generally:

a) Each move will be a phrase in its own right.

b) Each move will have a pause before and after it (pauses may be veryshort).

It is further assumed that the point of interruption of a move by aspeaker is important and will affect the model.

This design recognizes that amongst other things, any move boundary mayact as a turn-taking cue, and that move boundaries will generallycoincide with phrasal boundaries.

Consider the following example:

[What woildyoti like to do?] [2 sec pause] [You can say one of thefollowing] [news] [sport] [and weather] [2 sec pause] [or hold for anagent]

For the purpose of clarity we treat this as a single ‘turn’, but thedesign is actually neutral to this linguistic distinction. The designtakes as its input a sequence of moves which may be anticipated in theabsence of any user response, each potentially with its own anticipatedgrammar, and a specified pause following each move.

Turn starts are more likely at move boundaries, especially where thereis a pause between the moves. This invention adopts differentturn-taking behaviors depending on the point of interruption by theuser. In order to facilitate this, each machine move is broken up intothree optional zones:

i. Pre-Hold Zone

ii. Hold Zone

iii. Post-Hold Zone.

These zones can be seen in FIGS. 3 and 4. Each of the zones could beoptionally zero length but, where present, they follow in this sameordered sequence.

The three zones can be defined by just two breakspoints—termed thePreHoldTimeout and PostHold Timeouts. This invention does not depend onthe method by which these break points are determined. These twotimeouts could for example be derived from turn-taking likelihoodfunctions, or simply defined by the developer.

FIG. 3 illustrates the likelihood of a turn-taking act from the userdropping as the move progresses, it then rises again as the movecompletes. The moves in this diagram are assumed to present strongturn-taking cues to the user. The shape of these likelihood functionsare idealized and will vary depending on prosodic, semantic, pragmaticand other extra-linguistic cues which are present in the user-interface.For example, the prior knowledge of the user interface and thepropensity to use barge-in as a user interface device can alter thediagram of FIG. 3. Other methods of determining turn-taking likelihoodsmay also be envisaged by those skilled in the art.

With reference to FIG. 3, one way to determine the Pre-Hold, Hold andPostHold Zones and their corresponding timeouts would be to apply twoparameters shown below. Parameter Description Default LostFloorThresholdThe threshold below which 0.5 the machine turn moves from the Pre-Holdstate to the Hold state as the floor is taken away from the user by themachine. YieldAnticipationThreshold The threshold above which 0.5 themachine turn moves from the Hold state to the Post- Hold state, as theuser anticipates the turn-taking boundary that is approaching.

The first breakpoint occurs when the likelihood of a user—response tothe previous move has fallen below a certain value (LostFloorThreshold),the second where the emerging likelihood of a user—response to thecurrent phrase rises above a certain value (AnticipationThreshold).

If the function never reaches these thresholds then the Hold state neveroccurs. The PreHold state transitions directly into the PostHold state.This could be due to a sequence of short moves, low threshold values, orother parameters in the likelihood function model. In this circumstance,the boundary between these states is taken to be the point at which theminimum value of the function occurs (i.e. co-incident at the same pointwhere contribution to the turn-taking likelihoods are equal from theprevious move and the next move). If the minimum occurs at a point witha gradient of zero (i.e. has a fixed minimum value over a certain timeperiod, then the boundary is taken to be the time representing themid-point of this fixed region.

FIG. 4 illustrates the progression through the three zones as a statemachine where state transition points are defined by the timeoutsdescribed above, and the length of the prompt.

Pre-Hold Zone.

An interruption during the Pre-hold zone occurs prior to the keyinformation in the prompt being heard by the listener. It is thereforelikely to actually be a late response to the previous move. In thisregion the machine yields to this input and cuts the current promptdelivery.

Hold Zone.

In this zone, the likelihood of a turn-taking act by the user islessened considerably (according to the usual protocols ofconversational turn-taking)—however it is still possible. Interruptionin this zone is likely to be very late response to previous moves, or ananticipated response to the current move if the user is already familiarwith the dialogue model. In this region the machine will hold the floor.This does not however mean that it is ignoring the user input. Inapplications where users are familiar with the domain and/or chose touse the explicit barge-in style, the hold zone may be of zero length.This could happen dynamically in response to an explicit model of userturn-taking likelihood.

Post-Hold Zone.

In the post hold zone, the key information in the current move is likelyto have been perceived by the user. Interruption in this zone is likelyto be a normal anticipation of the end of the move. If key informationis revealed early in the move (A common design practice that mayindicate that there are multiple actual moves in the ‘move’) then thePost-Yield zone may actually be quite long. In this region it is assumedthat the machine will yield to interruptions. This yield may even gounnoticed if it is sufficiently near to the end of the move.

These three zones closely emulate human communication behavior. Wherethey are used in conjunction with user interface designs containingrelatively short moves they will result in stable user interfaces whichare intuitively accessible to users.

In this disclosure the interruptions which are initiated in the threedifferent zones result in either ‘yield’ or ‘hold’ behavior. Through thechoice of various parameters the generalized turn-taking enginedescribed below can deliver a continuum of behaviour from ‘immediateyield’ through to ‘always hold’. The yield zones use parameter setswhich result in relatively rapid yield and the hold zone uses differentparameters which results in the behaviour of holding the floor.

Example of Hold Zones

An un-contested Turn can therefore be viewed as a sequence of thefollowing zones:

Move1. Pre Yield

Move1. Hold

Move 1. Post Yield

Move 1. Yield Timeout

Move 2. Pre-Yield

Move 2. Hold

Etc etc. . . .

Move N. Pre yield

Move N. Hold

Move N. post Yield.

Move N. Yield Timeout.

(Turn ends with ‘silence’ on no response).

The YieldTimeout defines the pause following each move. The YieldTimeoutof the final move will be considerably longer in order to fully give theturn away at that point. Recall that moves can optionally omit PreHold,Hold or PostHold Zones by setting the appropriate timeouts, and that theYieldTimeout can be set to zero.

The user may of course also be making turns and moves in a similarfashion to the machine. With current technology the machineunfortunately has access to much less information regarding the userturn.

This design can utilize the speech application language tags (“SALT”)model. This is an event based model where listen and prompt areindependent threads—giving the designer the widest range of options yetfor building turn-taking models. The SALT model is commonly known in theart. Other similar models could be used. It is also anticipated thatspeech technology vendors will develop better ways of detecting userphrase boundaries, disfluent re-starts, and yielding behavior. Shouldthis happen then the current design will be able to make use of thisextra information.

The SALT model is a standard which is close to the state of the artregarding what machines can know about user turns as perceived by aspeech recognizer. The SALT model comprises independent <Prompt> and<Listen> threads which can be started—Start( ), paused—Pause( ) orstopped—Stop( ). Prompts and Listens throw events as they execute. It isthe designers role to catch these events and co-ordinate the interactionbetween <prompt> and <listen>.

There are three listen modes described by SALT: Single; Automatic; andMultiple. The current design anticipates that the recognition will useMultiple Mode (FIG. 6), although Multiple mode can be emulated usingAutomatic mode (FIG. 5) by restarting the recognizer whenever on Silence(silence detected), onReco (recognizable speech detected), onNoReco(un-recognizable speech detected) events are received. Single mode isdesigned for multi-modal push-to-talk applications which do not haveconversational turn-taking, and therefore it is not relevant to thisdisclosure.

In automatic mode the recognizer is running continuously unlessexplicitly stopped. It throws events when there is speech matching thegrammar (onReco) or speech not matching the grammar (onNoReco) has beendetected. It also throws events whenever the start of an utterance isdetected.

In one embodiment, FIG. 2 depicts a primary state machine that is usedto model the turn-taking behavior. A simple auxiliary state machine canbe used to keep track of the PreHold, Hold, and PostHold states of moveplayback as described above. As shown in FIG. 2, these states are:

1) System has floor

2) User grabbing floor.

3) System backed-off.

4) Both backed-off

5) User has floor

6) Both yielded.

An additional state can occur if the machine interrupted the userdeliberately. FIG. 2 shows the operation of a turn engine where thisstate is omitted.

Event Model

The state machine responds to the following events: onPromptCompleteSALT:prompt.onComplete onSpeechDetected SALT:listen.onSpeechDetectedonConfidentReco SALT:listen.onReco && NOT (Match(YieldGrammar))onYieldReco SALT:listen.onReco && Match(YieldGrammar) onNoRecoSALT:listen.onNoReco onPreHoldTimeout Thrown by PreHoldTimer.onHoldTimeout Thrown by HoldTimer. onBackoffTimeout Thrown byBackoffTimer. onRestartTimeout Thrown by RestartTimer. onYieldTimeoutThrown by YieldTimer.Most events are self-explanatory with the exception of onConfidentRecoand onYieldReco. This design allows designers to explicitly modelpartial yielded responses in the speech recognition grammar (representedby YieldGrammar). Successful matches against this grammar are taken tobe indications that the user has yielded the floor by giving anincomplete utterance. The onYieldReco event can be replaced by a betterclassifier for detecting back-off at a later date with no change to thecurrent design. The onConfidentReco event merely reflects the SALTonReco event, excluding this YieldGrammar. Most current speechrecognizers are unable to detect floor yield specifically. Thus thisinvention uses ‘low confidence recognition’ as a term to refer toYieldReco as well as NoReco.

It should be noted that any set of events performing the same functionas the above could replace the above definitions. For example bettermodels to distinguish confident results from rejected results could besubstituted. It is not the purpose of this design to describe optimalclassifiers of speech events. By way of example, the classificationonReco and onNoReco could be enhanced using a confidence measure basedon turn-taking likelihood as well as acoustic likelihood. Note also thatthe YieldGrammar performs the function of modeling spontaneous restartsin user speech within the turn taking state engine.

In some embodiments the machine responds differently to events dependingon the state it is in when the results are returned. Most notably can dothe following: Considering FIG. 2, transitions between the states aredenoted by arrows from one state to another. These transitions havetrigger conditions and actions associated with them. These triggerconditions and actions are shown in boxes attached to the transitionarrows. Transitions trigger conditions are shown in the first box inordinary type. These comprise boolean combinations of events and booleanfunctions based on parameters associated with the turn engine. When anevent is thrown, then relevant transitions from the current state thatcontain that event in the trigger conditions are evaluated. If there areadditional boolean guard criteria on the trigger condition these arealso evaluated. If a transition trigger condition evaluates to theboolean value ‘true’ then the transition is triggered. Once a transitionis triggered then actions associated with that trigger are thenexecuted. These are shown in bold type in the figure. Some transitionshave no actions associated with them and some have more than one actionassociated with them.

Timers

Timer objects are simple objects which run in their own threads. Theyare started with a timeout value. Timeout values are associated witheach move. They may vary from move to move depending on the movefunction and content. Timers are started withTimer.Start(timeout-value), Paused with Timer.Pause( ) and stopped withTimer.Stop( ) in a manner directly analogous to the starting, pausingand stopping of SALT objects. Once they reach their timeout values theythrow the corresponding on Timeout event. A paused timer is continuedwith a subsequent Start( ) operation. Timers are reset using the Stop( )operation.

Move Model

The turn taking model assumes that moves are represented by an array ofMoves—denoted M[n]. Each move has prompt—denoted M[n].prompt, and alisten element denoted M[m].reco. In this model each move models asingle phrase, and the associated reco object is pre-loaded with theappropriate grammar in response to that phrase and surrounding context.

In this design sequential moves are represented using a sequence of SALTprompt objects, one for each move. An alternative approach would be touse a single SALT prompt object for the whole turn. The moves would thensimply be part of this prompt and have embedded silence between them torepresent the YieldTimeout for each move. Bookmarks may be thrown at thestart and end of moves to synchronize the state machine with the moveboundaries.

Note that the reco for a move is started AFTER the prompt for a givenmove—the reco for the previous move is still listening as the nextprompt starts. This implements turn-overlapping in a straightforwardmanner, although there are other ways to implement the same feature.

Transition Actions

Actions are expressed as functions as shown below: Action DefinitionStartPrompt(n) M[n].Prompt.Start( )PreHoldTimer.Start(M[n].PreHoldTimeout) T[n].Start( ) StartPromptX(n)M[n].Prompt.Start( ) T[n].Start( ) PausePrompt(n) M[n].Prompt.Pause( )PreHoldTimer.Pause( ) T.Pause( ) StopPrompt(n) M[n].Prompt.Stop( )PreHoldTimer.Stop( ) StopReco(n) M.[n].Reco.Stop( ) StartReco(n)M.[n].Reco.Start( ) StartYieldTimer(n) PreHoldTimer.Stop( )yieldTimer.Start(M[n].YieldTimeout) StartYieldTimerNbi(n) If(NOTBargeIn) { yieldTimer.Start(M[n].YieldTimeout) } StartHoldTimer(n)holdTimer.Stop( ) holdTimer.Start(M[n].GetHoldTimeout(T))StartBackoffTimer(n) backoffTimer.Start(M[n].BackoffTimeout)StartRestartTimer(n) restartTimer.Start(M[n].RestartTimeout)The turn engine starts by setting the move index (n) to zero and thenplaying the prompt associated with this first move (100).Timeouts and their EffectMove Timer (T)

The timer value (T) denotes the reference timestamp of the current placein the playback of the move prompt. This timer is started when a moveprompt is started, paused when the prompt is paused, but not stoppedwhen the prompt is stopped.

This timer drives the transition through the different zone states shownin FIG. 4. Multiple concurrent move timers can exist, one for each move.These timers can be left to run until the end of the dialog, one foreach move. This is only a necessary feature if a Turn Confidencefunction is used in order to calculate the move zones. In thisalternative form the turn taking onset likelihood is calculated as thesum of the contributions of previous moves as well as the current move.More details on onset likelihood are provided by commonly owned,co-pending patent application “Turn-Taking Confidence” by Attwater, Ser.No. ______, filed on Dec. 22, 2005. For practical reasons these timerscan be stopped after an appropriate period of time—for example at theend of the turn, or after a number of further moves have elapsed. TheMoveTimer does not directly trigger any transitions in the state machineof FIG. 2.

Yield Timeout

This is the simplest of all the timeouts. It defines the pause followinga move once it has completed. It is analogous to the ‘InitialTimeout’ ofthe SALT listen element. The state both-yielded (6) is entered at theend of the prompt for each move when onPromptComplete is thrown (101),and held until the YieldTimeout completes or speech is detected. Ifspeech is detected during this wait (102), the user has taken the floor,and user-has-floor state is entered (5). If the timeout completes, thenext move prompt is started by incrementing the move counter n, andstarting the prompt associated with the next move denoted by n (103).Alternatively the turn completes if there are no waiting moves (104).

This transition (104) represents one way to end the turn state engine.Note that in this design the recognizer is not stopped at the end of theturn—allowing its operation overlap with the beginning of the next turnby re-entering the state machine once more. The PreHoldTimeout of thefirst move of the next turn controls this overlap (105). In this wayseamless overlaps between turns are achieved by this design.

PreHoldTimeout

This timeout represents the time from the start of the prompt associatedwith a move up to the point where the PreHold zone ends. Whilst in theSystem-Has-Floor state (1), this timeout causes the Reco object for theprevious move to be stopped, and the one associated with current move tobe started, implementing the overlap mechanism described above (105). Ithas already been mentioned that each move has its own associatedgrammar. In many cases this grammar may be the same for each move, butthere may be reasons to change the grammar between moves—for example ifthe next move is likely to significantly alter the user's expectationsof what to say next.

In one embodiment, the recognizer is not stopped at the end of themove—allowing its operation to overlap with the beginning of the nextmove. The PreHoldTimeout can define the point at which the recognizerwill be re-initialized with the new grammar. During the PreHoldZone thecaller is judged to be more likely to be responding to the previous moverather than the currently evolving one.

Indeed if the turn eventually completes successfully as a result of aninterruption prior to the PreHoldTimeout then the result must havematched the grammar for the previous move not the currently evolvingone. This can occur if the state machine completes via transition (112)or transition (118) in FIG. 2.

Similarly the eventual rejection of an utterance which starts prior tothe PreHoldTimeout completing can be treated similarly. This can occurif the state machine completes the turn via transition (117) ortransition (119) in FIG. 2. Such a rejection will represent the failureof the previous turn. Under such circumstances it would be sensible fordialog design to return to the previous dialog state or enter errorcorrection associated with the previous dialogue state. This is adesirable feature but not an essential feature of the invention. If allof the moves are associated with the same grammar then this feature isnot required.

The PreHoldTimer is Started( ) and Paused( ) and Stopped( ) in concertwith the prompt for the current move via the functions StartPrompt( ),PausePrompt( ) and StopPrompt( ). The value of the PreHoldTimeout shouldnot exceed the length of its associated move prompt.

HoldTimeout

The Hold Timeout denotes how long the system holds the floor afterdetecting an interruption by the user until it actually cuts the prompt.It is started when speech is detected in the System-Has-Floor state(106) which causes the transition to the User-Grabbing-Floor state (2).This is a way of extending the delay which may already occur between theactual onset of user speech or noise, and the point where the recognizerreports this as on SpeechDetected. In some embodiments it is preferablefor this integral delay to be relatively short.

Cutting the prompt is a turn yielding action and changes the state ofthe turn-taking protocol. The Hold Timeout therefore lets the systemwait a while. In the absence of any other events the completion of thistimeout period causes the current move to be paused (i.e. stopped, butin such a way that it could be continued later if necessary), and theSystem-Backed-Off state is entered (107). This is only useful if therecognizer can report useful information in this timeout period. If wecould poll for results then we may be able to see how the recognitionwas evolving in this period (This would emulate what some recognizersalready do in this regard)—However in the SALT model, the system canwait to see whether an OOG (out-of-grammar) result or very rapidrecognition is returned. Brief noises may be ignored as a result.

If the move completes while the user is grabbing the floor then the useris automatically granted the floor and the state User-Has-Floor isentered (110). This timeout is varied according to when the prompt isinterrupted, and is the primary mechanism for implementing the hold oryield strategies in the different move zones. Even in PreHold andPostHold zones non-zero values may be desirable.

The HoldTimeout should not be confused with the PreHoldTimeout. Thefunction GetHoldTimeout(T) embodies this behavior. In some embodimentsthis function can return the timeout values according to which zone theinterruption occurred in as follows: int GetHoldTimeout(int T) { CaseGetZone(T): (Pre Hold): HoldTimeout=200ms; break; (Hold):HoldTimeout=infinity; break; (Post Hold): HoldTimeout=300ms; break;Return HoldTimeout; }The non-zero values in the Pre and Post Hold regions can be used to dampthe recognizer response to short noises in these regions.

FIG. 4 shows a possible definition for the function GetZone(T). Thestates PreHoldPause, HoldPaused and Post Hold Paused represent that casewhere the prompt has been paused whilst in the corresponding zone.

In the User-Grabbing-Floor state (2)—defined by the HoldTimeout—Noisesor partial Yields are ignored and the system returns to theSystem-Has-Floor state (108). Confident responses which complete beforethe timeout are also apparently ignored from the user perspective (109).The results from such confident responses can be remembered and used toboost the confidence of subsequent recognitions of the same utterance.Such behavior is less desirable in the Pre and Post yield zones, hencethe timeout values for these zones are selected to be shorter than theshortest anticipated word. They can be increased if long utterances areanticipated. In an alternative embodiment, confident results can beaccepted in the state User-Grabbing-Floor and complete the state machinein a manner analogous to transitions from the Both-Backed-Off state(112) and the User-Has-Floor state (118).

Those skilled in the art could conceive of many different definition ofthe function GetHoldTimeout. Each definition would define the machines‘attitude’ towards interruptions and floor holding. FIG. 11 shows anexample of how based on the point of speech or noise onsets a continuousfunction could be used to derive the Hold Timeout. In the example, theHold Timeout rapidly increases from a minimum value (100 milliseconds)to a finite maximum value (10 seconds) at the boundary of the Hold zone.The timeout then falls steadily (in the logarithmic domain) through theHold and Post Hold zones. The use of finite values means that the turnmachine opearates upon a continuum between floor yielding and floorholding policies rather than the abrupt transition between floor holdingand floor yielding described above.

The HoldTimeout could alternatively be selected such that it causes theprompt to be paused at the end of the next spoken word. Such behaviorwould be made possible by making use of the timing information obtainedby forced alignment of the prompt text and a recorded utterance, orobtained directly from a speech synthesizer.

Back Off Timeout

The Back Off Timeout begins as the System yields the floor to aninterruption following the Hold Timeout (107). This timeout representsthe perceptual period following back-off in which the user will notperceive that the turn has been un-equivocally given away by the system.It defines the maximum period in which the state machine will remain inthe System-Backed-Off state (3). Note that the machine prompt is Paused,not stopped following this timeout (107). If, during this short back-offperiod just after the machine has fallen silent, the end of user speechin the form of an uncertain recognition or yield is detected, the turntaking engine will assume that the user has backed off and proceed tothe both-backed-off state (111). Confident recognition during theback-off state however is taken at face value and the turncompletes—more accurately the state machine completes to let the parentstate machine decide on the next course of action (112).

If no recognition result occurs during the backoff period then, oncompletion of the BackOffTimeout, the user is still talking, and is nowassumed to be clearly holding the floor. The state machine progresses tothe user-has-floor state (113). A default value of less than 200 ms issuggested for this timeout.

RestartTimeout

Once the Both-Backed-Off state is entered (4) the Restart Timer isstarted (111). In this state the user has fallen silent and the machineis also silent. An impasse has occurred. The Restart timer can definethe duration which the machine is prepared to wait to see whether theuser will re-start their utterance spontaneously. If the user restartsduring this time period then the current move is considered complete,and the user is unequivocally given the floor, moving to theUser-Has-Floor state (114).

This timeout can be set to a relatively long period, as it is analogousto the usual recognition silence timeout. A value of between 0 and 1000ms is suggested but it could be longer. Changing this value will affectthe emergent property of whether the machine is eager to grab turnsback. This will affect the perceived personality of the machine, andthis timeout may be set according to design criteria. If the user doesnot speak before the timeout is reached, then one of three transitionsoccur:

-   -   1. If the interruption point was early in the prompt (defined by        the logical NOT of the boolean function IsLateInterruption), and        a policy of Restarting on Backoff (defined by the boolean        function RestartOnBackoff) is in place, then the move can be        re-started. This will result in the apparent effect of a        disfluent restart of the whole turn on behalf of the machine in        response to external interruptions (115). Disfluent re-starts        can be perceived by users as inappropriate behavior and often        not recalled after the interaction.    -   2. If the same condition as above occurs and RestartOnBackoff( )        is false, then the machine prompt can be started again—i.e. it        continues from where it was paused. This behavior is only        appropriate if the hold timeout and restart timeout have low        values. (116).    -   3. If the interruption point was late in the prompt, the turn        can be completed. The user has chosen not to speak, and the        machine must decide what to do next (117).        A possible enhancement to the first step of re-starting a move        (115) could be to restart from the start of the previous move        boundary instead of the start of the turn. In this case the        value of n would not be reset to zero. A further alternative        would be to modify the form of the repeated turn or move. A        subtly different prompt could be used with the same meaning        which could prevent a mechanical sounding effect on repetition        of previous prompts. In addition or as an alternative a re-start        signal phrase such as ‘Sorry!’ may be interjected to signal the        re-start. This is a behavior often seen in human language during        disfluent re-starts.

A possible enhancement to the second step (116) would be to continue theprompt at the point where it would have reached had it continued to playto reduce the perceptual impact of the interruption. Another similarapproach would be to begin to modulate the volume of the prompt down toa low value on transition to the System-Backed-Off state. Thismodulation could follow some kind of amplitude envelope designed toemulate the effect of a speaker cutting off an utterance in midvocalization. The volume would be modulated up again on return to theSystem-Has-Floor state. A further enhancement would be to back-track toa previous syllable boundary in the current move—if the syllableboundaries of the current utterance are known to the system.

The IsLateInterruption function can be defined in a number of ways. Oneembodiment would return a false value if the interruption occurredduring the PostHold zone but true otherwise. This means that thefunction would return true in the Hold zone and Post Hold zone. Wherethe HoldTimeout is infinite in the Hold zone then the condition cannotactually occur under normal circumstances. This definition anticipatesthe case where the HoldTimeout could also have a finite value in theHold zone. In an alternative embodiment, this function could alwaysreturn a false value. In this case transition 117 would never occur andthe decision whether to restart the prompt or not would be independentof the point of the interruption.

In a further embodiment of the present invention, the system couldmodulate the timeout parameters according to a turn confidence function.The turn confidence function factors in the confidence applied to a userinput based upon the timing of the user input.

Recognition Returns when User has Floor

In this case confident or non confident recognition will cause the turnto end. It is up to the application to decide what to do next (118)(119). In the unlikely event that an on Yield event occurs under thesecircumstances the user can be given the chance for a re-start asdescribed above (120).

Simplifying the Model for Non-Salt Applications

This invention is made possible by the fine grained control that theSALT model provides for the Prompt and Reco SALT elements. On most ofthe current day speech application platforms the developer does not havethis level of control. More specifically where the barge-in model isfixed then control over when the prompt is cut relative to theinterruption is pre-defined regardless of the point of interruption.Assuming that barge-in is enabled on the platform then the recognizer isstarted at the start of the turn. If barge-in is not enabled then therecognizer is started at the completion the prompt.

Most current combinations of speech recognizer and telephony platformonly support ‘AutomaticMode’. This invention can be implemented usingthis mode by restarting the recognizer explicitly when it returns aresult as described above. With such systems the User-Grabbing-Floorstate (2) cannot be supported, and the on PreHoldTimeout transition(105) becomes redundant. FIG. 7 shows the effect of reducing FIG. 2under these limitations. Transitions (106) and (107) are merged to formtransition (606). The resulting engine has lost a couple of features.First, it loses the power to overlap prompting and recognition betweenmoves. The actions in transitions (100), (103), (115) and (116) arealtered accordingly to give those shown in transitions (600), (603),(615), and (616). There is still the opportunity to change the grammarat the boundaries between the moves. In many cases however this will notbe required. Second, the prompt cut-off behavior can be the sameregardless of the timing of an interruption. Several features remainhowever.

The BackoffTimeout can still be implemented. This means that potentialuser back-offs can be detected, and the engine can decide whether torestart the move or turn again as described above. Recall that this isequivalent to a disfluent restart by the machine. It can be seen thatthis feature is independent of the manner in which the prompt was ceasedfollowing the detection of a user interruption. Note also that theIsLateInterruption function can still be defined in such a way that thedecision to restart the prompt can be dependent on the point ofinterruption of the move.

The RestartTimeout can also still be implemented. This means that userswho do back-off can still be given the opportunity to re-start theirbacked-off utterance.

Noise Stabilization by Modifying the Floor Holding Properties of WholeTurns

In a further embodiment of the invention there are circumstances inwhich the floor-holding policy may need to be adjusted depending on theevolving history of the call. Examples of such instances may be in thesuspected presence of noise for example. FIG. 8 shows a simple statemachine for a question-asking device which exemplifies this approach.The device comprises three turn state machines. There is an initialquestion—‘Question1’ (714), a repeat question—‘Repeat’ (715) and afollow-on question—‘Question2’ (716). Each of these turn state machinescan comprise the state machine shown in FIG. 2 with two additionalfeatures. The first additional feature of the turn state machines is aBoolean parameter called ‘BargeIn’ which is associated with the statemachine as a whole. This parameter modifies the behavior of the statemachine as follows. Where this parameter is true, the turn state machinecan yield the floor in the PreHold and PostHold zones as alreadydescribed. Where this parameter is false then the turn state machine canhold the floor in all three zones (i.e. the HoldTimeout is set toinfinity). Those skilled in the art will recognize that this isanalogous to the current widely adopted practice where a boolean‘BargeIn’ parameter controls whether a prompt is interruptible or not.Technically speaking the turn engine implements a ‘Yield When Confident’when the HoldTimeouts are all set to infinity—the recognizer never stopslistening and will cut the prompt if a confident result is received.

The question-answering device shown in FIG. 8 could be built using anyprompt and recognition state machine that supports controllable BargeIn.Almost all current speech platforms fall into this category. The secondadditional feature of the turn state machine is an extra counter simplynamed ‘RestartCount’ which is used to keep a record of the number oftimes that the turn state machine is re-entered.

The question answering device shown in FIG. 8 starts by settingQuestion1.RestartCount to zero and setting Quesiton1.BargeIn to true(701). The turn state machine Question1 (714) is then entered. Thisstate machine executes its own internal states, playing the prompt e.g.‘What medical service do you want?’ and performing recognition using arelevant grammar until it reaches the end state. On reaching the endstate of FIG. 2 events are thrown depending on how the end state wasreached—namely:

onConfidentReco—via transitions (112)(118)

onNoReco—via transitions (117) and (119)

onTurnComplete—via transition (104)

In the case of sustained background noise, the most likely completionevent will be onNoReco via transition 119 with the stateUSER-GRABBING-FLOOR in the turn state machine state history list. Thestate machine of FIG. 8 catches these events and uses them to controlthe higher-level flow of the question device.

The onConfidentReco event is further qualified by an additional Booleanparameter ‘Inter-word’. The onConfidentReco(NOT Inter-word) event (707)indicates that a confident speech recognition task has been completedand that a differentiated high-scoring recognition result has beenreturned by the speech recognizer. In this case the question answeringdevice can complete throwing a on Success event status (717).

The onConfidentReco(Inter-word) event (706) also indicates that aconfident speech recognition task has been completed, but there are anumber of close-scoring items in the candidate list from the speechrecognizer. In this case the device simply asks the caller to repeat theanswer using another turn engine—Repeat (715) with the prompt ‘Onceagain?’. Given that a successful recognition has occurred there is noreason to suppose that there is a particularly noisy environment so theRepeat.BargeIn is set to true (719). A confident result from this turn(708)—leads to successful completion of the question device (717). Forsimplicity of this description it is assumed that in the event of anInter-word status from turn via (708) the result will be compared withthe result from the first recognition and the best scoring candidatebased on the two uncertain results is chosen. The precise nature of thiscomparison is not relevant to this invention and those skilled in theart will be aware of various methods to achieve this. An onNoReco status(710) can lead to failure of the question device as a whole (718).

Should the initial Question1 turn (714) return with an onNoReco eventthen one of two different state transitions may occur. A Booleanfunction NoisyInterruptionRestart( ) is used to determine which of thesetwo transitions occurs. In the case where this function is true (704)the first question turn is started again. However prior to starting theturn again Question1.BargeIn is set to false and Question1.RestartCountis incremented (705). This transition is intended to trigger when itseems likely that the first attempt at asking Question1 failed due toenvironmental noise—for example background chatter or other noises. Inthe simplest case the function NoisyInterruptionRestart( ) could assumethat all failures of the turn were due to noise and simply use the countof the number of times the turn question has been asked.

A better alternative would be to assume that all turn failures whichpassed through the state USER-GRABBING-FLOOR when BargeIn was enabledwere due to noise that caused a premature interruption of the prompt.The following reflects this. boolean NoisyInterruptionRestart(Turn turn){ if (NOT turn.BargeIn) return false; if(turn.MatchHistory(USER-GRABBING-FLOOR) { if (turn.RestartCount==0)return true; } else return false;. }

Those skilled in the art could conceive of other more complexdefinitions of this function which also take into account, for example,the prior history of dialog as a whole. Another alternative could takeinto account the turn confidence of previous turns to make its decision.More accurate determinations as to whether the recognition failure wasdue to noise or some other user behavior could be conceived. Thisinvention will benefit from such improvements, but has utility in thesimpler form described here.

Setting Question1.BargeIn to false in step (705) has the effect ofpreventing any environmental noise from accidentally causing the repeatof the prompt for Question1 from being accidentally interrupted. Thisalso has the effect that any user speech which interrupts this promptwill also be apparently ignored by the state machine—although recallthat the turn state machine can continue to listen and remember resultswhen it is holding the floor. This floor holding may appear to beundesirable but recall that noise is likely to prevent the caller evenhearing the whole turn before it is interrupted. Without floor holdingthe user may never even hear the question that they are expected toanswer. It is thus much more preferable to the alternative of anunstable user interface in the presence of noise. Users who do speakover the prompt will find that the machine does not back-off—and as aresult they will probably back-off instead (See FIG. 1D). As has alreadybeen stated—such circumstances are common in human-human conversationand the user will simply repeat the utterance at the end of the turn.The noisy environment will still affect the recognition accuracy of suchan utterance, and may still result in an onNoReco event. The user has atleast been guaranteed to have heard the whole turn and been given oneattempt at answering the question.

In the case where this function is false (702) then the assumption ismade that the first turn failed for reasons other than the first noisyinterruption. The event on TurnComplete from Question1 is also simplytreated as a failure—in this case the failure of the user to present anyspeech at all. This causes a follow-on turn—Question2 (716)—to bestarted. Before this, Question2.RestartCount is set to zero as perQuestion1, however Quesiton2.BargeIn can be set to the same value as themost recent value of Question1.BargeIn (703). In this way the userinterface continues to assume that there is a noisy environment andtherefore holds the floor for the next question. The designer is free tochoose the policy for promulgating this throughout the subsequentdialog.

In the case where Question2 completes confidently (709) the questiondevice similarly completes throwing the on Success event (717). Similarcomments to those above apply regarding the handling of an Inter-Wordcondition under these circumstances. Where it completes with onNoRecothe same pattern of detecting noisy interruptions can be followed as perQuestion1 (712 and 713). This path cannot be followed in the case whereQuestion2.BargeIn is set to true however thus avoiding successivequestion restarts. All other failures can result in the question devicereturning a failure condition (711).

It should be noted that these patterns can be applied to existing speechsystems, although the feature of listening to speech during the promptwhen the barge-in is false is rarely possible. This does not howeverstop the restart behaviour from being implemented.

Noise Stabilization Internal to the Turn Machine

A further extension to the approach above would be to embed this noisestabilization approach into the turn machine itself. FIG. 9 shows howthis is achieved. FIG. 9 is based upon the state machine of FIG. 2 orFIG. 7. For simplicity the states USER-GRABBING-FLOOR andSYSTEM-BACKED-OFF are omitted because they do not require anymodification. Extending the state machine with the floor holdingre-start pattern described above is straightforward. Firstly the newlyadded RestartCount can be set to zero on entry to the state machine(833). Secondly the transition from USER-HAS-FLOOR to the end stateonNoReco (819) is modified to prevent the state machine completingunless NoisyInterruptionRestart( ) is false. Also the BargeIn flag mustbe true or the turn machine must be presenting the final move. Forsingle-move turns this is always true and therefore not relevant. Multimove turns will be discussed later. An extra transition (829) fromUSER-HAS-FLOOR to SYSTEM-HAS-FLOOR then matches these excludedconditions. This new transition is the equivalent of moving the externalrestart transitions in FIG. 8 (i.e. 704 and 712) internal to the turnstate machine. The BargeIn flag for the turn can be set to false (830)and the RestartCount can be incremented on this transition (831). Thisperforms the same function of those operations seen in the transitionsof FIG. 8 (705 and 713). The HoldTimeout(s) are also modified dependingon the BargeIn flag as described above.

This change means that, subject to the definition ofNoisyInterruptionRestart( ), all turns are now be capable of the noisyrestart behavior. It is sometimes desirable that this behavior can besuppressed on demand by the designer. The turn ‘Repeat’ in FIG. 8 (715)for example does not require the repeat behavior. An extra Booleanparameter ‘AllowNoisyInterruptionRestarts’ is added to the turn enginein order to achieve this. The definition of NoisyInterruptionRestart( )thus becomes: boolean NoisyInterruptionRestart(Turn turn) { if (NOTturn.AllowNoisyInterruptionRestarts) return false; if (NOT turn.BargeIn)return false; if (turn.MatchHistory(USER-GRABBING-FLOOR) { if(turn.RestartCount==0) return true; } else return false;. }

With just these modifications, the revised turn state machine of FIG. 9can be used to deliver the same behavior as that shown in FIG. 8. FIG.10 shows a new question asking device which uses the turn engine of FIG.9 instead of that of FIG. 2 or 7. Note how the new question answeringdevice has no need now to be aware of the restart behavior of the turns.

There are two benefits to internalizing this behavior. The first is thatthe turn engine of FIGS. 2 and 7 already had the ability to instigatetheir own internal re-starts. This happens on the transition fromBOTH-BACKED-OFF to SYSTEM-HAS-FLOOR (115) which caters for the conditionwhere both the machine and the user have backed-off and the machinedecides to start the turn again. In the turn engine of FIG. 2 or 7 suchre-starts could potentially occur more than once if the functionRestartOnBackoff( ) didn't keep count of the number of restartsattempted. This transition has not been altered in FIG. 9 but the newlyadded RestartCount parameter is now incremented when this transitionhappens (831). This counter can now be shared between RestartOnBackoff() and NoisyInterruptionRestart( )—ensuring for example that a turnrestart only occurs once in the execution of the whole turn engineregardless of the cause of the turn restart.

The second benefit for internalizing the noisy restart behavior concernsmulti-move turns. In the example of FIG. 8 all of the turns comprised asingle move. If however, the turns were made up of multiple moves thenthere may have been a potential problem. Recall that the state machineis always listening. If speech onset is detected during prompt playbackthen the floor is given over to the speaker once the current move iscompleted even when the HoldTimeout(s) are infinite. Similarly speechonset during the pauses between moves causes the floor to be given overto the user. Thus without further modification to the turn engine,setting the BargeIn flag of the engine to false would still allow usersto halt the progress of the turn at the first move boundary followingspeech onset. Recall that the pauses between moves are not generallypoints where the turn is completely relinquishing the floor (i.e.elective turn boundaries) but they are points where strong turn-takingcues are often present. Allowing onsets to lead to a turn-grab in suchplaces is sensible behavior in a quiet environment, but if there issystematic environmental noise then it is very likely that the onSpeechDetected event may occur spuriously and cause the floor to befalsely yielded at such points.

For this reason the turn engine can be modified so that when the BargeInflag is set to false it does not yield the floor at the move boundaries,in addition to floor-holding during prompt playback itself. Themodification does not actually ignore onsets-it merely stops onsets frompreventing subsequent moves from starting. The recognizer is stillactive during this period. In order to achieve this, a new state hasbeen added to the turn engine. This state, SYSTEM-GRABBING-FLOOR (7),represents the condition where the machine is grabbing the floor fromthe user. The operation of these modifications is described below by wayof an example sequence of events in a three move turn.

Referring to FIG. 9, imagine that we are in the User-Has-Floor state anda transition has just been triggered by an OnNoReco from a noisy restart(829) at some point during the presentation of the first move of a threemove turn. As described above, the RestartCount is incremented (831) andthe BargeIn flag is set to false (830). As in the case of restartscaused by back-offs (115) the move counter n is then reset to zero, andthe prompt associated with this move is started along with thePreHoldTimer. The turn starts executing from the beginning again—that isto say it restarts.

In order to explore the initial evolution of our example we need toreturn to FIG. 2 because for clarity the relevant states are omitted inFIG. 9. Let us now imagine that background noise immediately causes onSpeechDetected and a transition occurs (106) to the stateUSER-GRABBING-FLOOR. The BargeIn flag has set the HoldTimeout toinfinity for all the move zones so transition 107 cannot cause theSYSTEM-BACKED-OFF state to be entered and the prompt will not be cut.Assume that the prompt for the first move completes throwing onPromptComplete. The recognizer is still listening to noise so the statemachine moves to the state USER-HAS-FLOOR via transition 810. In theordinary operation of FIG. 2 the subsequent moves would be suppressed atthis point because the user has been given the floor. However, given theBargeIn flag is false we assume that the incoming ‘speech’ may actuallybe noise. For this reason the YieldTimer associated with the first moveis started (839). Recall that this is the timeout between two moves andwill thus be fairly short. Let us assume that this timeout completeswhilst the recognizer is still listening to the noise when the onYieldTimeout event is triggered. The LastMove has not been reached andthe BargeIn flag is false so a transition to SYSTEM-GRABBING-FLOORoccurs (822). This increments the move counter and starts the prompt forthe next move (835) grabbing the floor back from the user—which in thiscase may be merely background noise. The turn engine has thus decided tostart the next move in spite of the fact that the user may still bespeaking.

Note that, unlike the case of the SYSTEM-HAS-FLOOR state, a PreHoldTimeris not started with the new move prompt. This is because in theSYSTEM-GRABBING-FLOOR state a recognition match is already known to beevolving. It would not be appropriate to kill it at the PreHold boundaryand restart it—because confident recognition could be occurring.Instead, the recognizer can be stopped and restarted (837) on transitionto the SYSTEM-HAS-FLOOR state in response to an onNoReco or onYieldRecoevent (824). That is to say the recognizer for the current move isstarted when the user appears to have backed-off or the noise hasceased.

As an aside, imagine the case where the user had in fact been utteringan in-vocabulary utterance and continued to speak in-spite of themachine grabbing the floor. Let us further assume that the recognizerreturned a confident result (onConfidentReco) just after the systemstarted to grab the floor back. In this case, the turn ends successfully(825) and the current prompt is stopped (836). Thus, in-spite of theBargeIn flag being set to false, the turn engine was still listening,and confident results do in fact force the turn to complete. Thisstrategy is similar to the ‘YieldWhenConfident’ one discussed previouslyrather than the ‘Always Hold’.

We return to the case where the system is grabbing the floor in thepresence of noise. Let us further imagine the noise doesn't end and theprompt for the second move also completes. As before if the BargeIn flagis true (as it is likely to be given we are in SYSTEM-GRABBING-FLOORstate) then the yield timer for the next move is started (838) and theuser has the floor again. (826)

Now, the recognizer throws an onNoReco event during the pause betweenthe second and third move prompts. The BargeIn flag is false and thelast move has not started yet so the turn transitions into theBOTH-YIELDED state (827). This is another good point, if necessary, tostop the previous recognizer and ensure the recognizer matches thecurrent move (841). In our example the visit is short lived however ascontinuing noise immediately triggers the on SpeechDetected event andthe engine returns to the USER-HAS-FLOOR state (102).

As another aside it should be noted that with BargeIn set to false inthe turn engine of FIG. 9, in the case of the onYieldReco event, theengine does not transition to the BOTH-BACKED-OFF state via transition(820); instead it can immediately transitions to the BOTH-YIELDED statevia transition (827). The transition into the BOTH BACKED-OFF state(820) cannot occur. This avoids the need for the BOTH-BACKED-OFF stateto deal with on YieldTimeout events. Thus users are not given thebenefit of the RestartTimeout to restart their utterances. This is inline with the policy of floor holding when the BargeIn flag is false.

Returning to our example, the user (or noise) has the floor followingthe on SpeechDetected event. The YieldTimer for the second move thencompletes (822) and the prompt for the final move is started (835). Letus assume that this prompt completes before any recognition status isreturned. In this case the USER-HAS-FLOOR state is re-entered (826) andthe final yield timeout starts (838). If this final yield timercompletes, it is now ignored (823). This is because we have nowcompleted the prompt for the final move and are at an elective turnboundary—i.e. the outcome of the next recognition event will determinethe outcome of the whole turn. Confident recognition in this final phasewill result in the turn engine completing. onNoReco will also cause theturn to complete assuming that the function NoisyInterruptionRestart( )does not permit more than one restart in a turn. The RestartCount is nownon-zero so the turn will not be restarted again via transition (829).

Subtle alterations to the emergent restart behavior can be envisaged bythe re-definition of the functions NoisyInterruptionRestart( ),IsLateInterruption( ) and RestartOnBackoff( ). More than one restartcould be permitted for example, or restarts in response to back-offscould be counted separately to those caused by apparent noise. Thedefinition of IsLateInterruption( ) could be based on the location ofspeech onset during the turn rather than the moves. This may be moreappropriate in turns which have a large number of short moves.

One additional feature of FIG. 9 is that there are now two START statesdepending on the state of the speech recognizer on entry into the statemachine. Recall that the turn state machine is designed such that theexit state can be connected to the input state to achieve continuousrecognition even across turn boundaries. By adding theSYSTEM-GRABBING-FLOOR move the turn machine can now be started correctlyeven under the condition that the recognizer has already detected userspeech. Imagine the case where another state machine external to theturn machine has detected the condition that a user has begun to speak.Imagine further that this state machine decides that it wants tointerrupt the user. This may be because the user has been speaking fortoo long, or the recognizer is listening to continuous noise. Theexternal state machine can initiate the turn engine and enter it via thespeech detected start state (828). The first move prompt can be startedand the machine can enter the SYSTEM-GRABBING-FLOOR state and interruptthe user. The turn machine will then continue to run in just the samemanner as if the turn machine itself had initiated the interruption.

In another subtly different embodiment, restarts which are caused byback-off (115) could also set the BargeIn flag to false. This may becomenecessary in environments with intermittent noise which may be mistakenfor backed-off speech by the turn engine.

It is understood that multiple embodiments can take many forms anddesigns. Accordingly, several variations of the present design may bemade without departing from the scope of this disclosure. Thecapabilities outlined herein allow for the possibility of a variety ofnetworking models. This disclosure should not be read as preferring anyparticular networking model, but is instead directed to the underlyingconcepts on which these networking models can be built.

This disclosure comprises multiple embodiments. In a first embodiment, amethod for managing interactive dialog between a machine and a usercomprises: verbalizing at least one desired sequence of one or morespoken phrases; enabling a user to hear the at least one desiredsequence of one or more spoken phrases; receiving audio input from theuser or an environment of the user; determining a timing position of apossible speech onset from the audio input; and managing an interactionbetween the at least one desired sequence of spoken phrases and theaudio input; in response to the timing position of the possible speechonset from the audio input. The first embodiment, further comprisingmanaging the interaction in response to a timing position of a possiblespeech onset within a plurality of time zones, wherein the at least onedesired sequence of one or more spoken phrases comprises the pluralityof time zones. The first embodiment, wherein the plurality of time zonesare dependent upon a continuous model of onset likelihood. The firstembodiment, further comprising adjusting the at least one desiredsequence of one or more spoken phrases in response to the timingposition of the possible speech onset from the audio input.

The first embodiment, further comprising: stopping the at least onedesired sequence of one or more spoken phrases; restarting the at leastone desired sequence of one or more spoken phrases; or continuing the atleast one desired sequence of one or more spoken phrases. The firstembodiment, further comprising: adjusting the timing corresponding tostopping the at least one desired sequence of one or more spokenphrases; adjusting the timing corresponding to restarting the at leastone desired sequence of one or more spoken phrases; or adjusting thetiming corresponding to continuing the at least one desired sequence ofone or more spoken phrases.

The first embodiment, further comprising: continuing the at least onedesired sequence of one or more spoken phrases for a period of time inresponse to an interruption of the audio input; and receiving audioinput during the period of time. The first embodiment, wherein aconfiguration of a process to produce a recognition result from theaudio input is dependent upon the timing position of the possible speechonset. The first embodiment, wherein a possible speech onset by theaudio input during a beginning portion of one time zone is considered tobe in response to a previous time zone. The first embodiment, whereinaudio input further comprises user input that corresponds to dual tonemulti frequency (“DTMF”).

In a second embodiment, a method for interactive machine-to-persondialog comprising: verbalizing at least one desired sequence of one ormore spoken phrases; enabling a user to hear the at least one desiredsequence of one or more spoken phrases; receiving audio input from theuser or an environment of the user; detecting a possible speech onsetfrom the audio input; ceasing the at least one desired sequence of oneor more spoken phrases in response to a detection of the possible speechonset; and managing an interaction between the at least one desiredsequence of one or more spoken phrases and the audio input, wherein theinteraction is dependent upon the timing of at least one recognitionresult relative to a cessation of the at least one desired sequence. Thesecond embodiment, further comprising restarting or not restarting theat least one desired sequence of one or more spoken phrases in responseto the timing position of receipt of the recognition result. The secondembodiment, wherein restarting the at least one desired sequence of oneor more spoken phrases further comprises altering the wording orintonation of the at least one desired sequence of one or more spokenphrases.

The second embodiment, wherein restarting the at least one desiredsequence of spoken phrases further comprises restarting the at least onedesired sequence of spoken phrases from a point that is not a beginningpoint of the at least one desired sequence of spoken phrases. The secondembodiment, wherein restarting the at least one desired sequence ofspoken phrases further comprises restarting the at least one desiredsequence of spoken phrases from a point that is substantially near towhere the desired sequence of one or more spoken phrases ceased. Thesecond embodiment, further comprising adjusting an amplitude of the atleast one desired sequence of one or more spoken phrases in response toa possible speech onset, wherein ceasing the at least one desiredsequence of one or more phrases is achieved by a modulation of amplitudeover time. (D3)

A third embodiment, a method for interactive machine-to-person dialogcomprising: verbalizing at least one desired sequence of one or morespoken phrases; enabling a user to hear the at least one desiredsequence of one or more spoken phrases; receiving audio input from theuser or an environment of the user; detecting a possible speech onsetfrom the audio input; ceasing the at least one desired sequence of oneor more spoken phrases in response to a detection of possible speechonset at a point where onset occurred while the desired sequence wasbeing verbalized; and managing a continuous interaction between the atleast one desired sequence of one or more spoken phrases and the audioinput, wherein the interaction is dependent upon at least onerecognition result and whether the desired sequence of one or morespoken phrases was ceased or not ceased.

The third embodiment, wherein in response to a low confidencerecognition result, a subsequent desired sequence of one or more spokenphrases does not cease after a detection of a subsequent possible speechonset. The third embodiment, wherein the subsequent desired sequence ofone or more spoken phrases is substantially the same as the desiredsequence of one or more spoken phrases. The third embodiment, furthercomprising, in response to a subsequent low confidence recognitionresult, receiving audio input while continuing to verbalize the at leastone desired sequence of one or more spoken phrases, and in response to asubsequent high confidence recognition result, the subsequent desiredsequence of one or more spoken phrases ceases after detection ofpossible speech onset.

Having thus described specific embodiments, it is noted that theembodiments disclosed are illustrative rather than limiting in natureand that a wide range of variations, modifications, changes, andsubstitutions are contemplated in the foregoing disclosure and, in someinstances, some features may be employed without a corresponding use ofthe other features. Many such variations and modifications may beconsidered desirable by those skilled in the art based upon a review ofthe foregoing description of embodiments. Accordingly, it is appropriatethat the appended claims be construed broadly and in a manner consistentwith the scope of these embodiments.

1. A method for managing interactive dialog between a machine and a usercomprising: verbalizing at least one desired sequence of one or morespoken phrases; enabling a user to hear the at least one desiredsequence of one or more spoken phrases; receiving audio input from theuser or an environment of the user; determining a timing position of apossible speech onset from the audio input; and managing an interactionbetween the at least one desired sequence of spoken phrases and theaudio input; in response to the timing position of the possible speechonset from the audio input.
 2. The method of claim 1 further comprisingmanaging the interaction in response to a timing position of a possiblespeech onset within a plurality of time zones, wherein the at least onedesired sequence of one or more spoken phrases comprises the pluralityof time zones.
 3. The method of claim 2, wherein the plurality of timezones are dependent upon a continuous model of onset likelihood.
 4. Themethod of claim 1, further comprising adjusting the at least one desiredsequence of one or more spoken phrases in response to the timingposition of the possible speech onset from the audio input.
 5. Themethod of claim 4, further comprising: stopping the at least one desiredsequence of one or more spoken phrases; restarting the at least onedesired sequence of one or more spoken phrases; or continuing the atleast one desired sequence of one or more spoken phrases.
 6. The methodof claim 5, further comprising: adjusting the timing corresponding tostopping the at least one desired sequence of one or more spokenphrases; adjusting the timing corresponding to restarting the at leastone desired sequence of one or more spoken phrases; or adjusting thetiming corresponding to continuing the at least one desired sequence ofone or more spoken phrases.
 7. The method of claim 5, furthercomprising: continuing the at least one desired sequence of one or morespoken phrases for a period of time in response to an interruption ofthe audio input; and receiving audio input during the period of time. 8.The method of claim 1, wherein a configuration of a process to produce arecognition result from the audio input is dependent upon the timingposition of the possible speech onset.
 9. The method of claim 2, whereina possible speech onset by the audio input during a beginning portion ofone time zone is considered to be in response to a previous time zone.10. The method of claim 1, wherein audio input further comprises userinput that corresponds to dual tone multi frequency (“DTMF”).
 11. Amethod for interactive machine-to-person dialog comprising: verbalizingat least one desired sequence of one or more spoken phrases; enabling auser to hear the at least one desired sequence of one or more spokenphrases; receiving audio input from the user or an environment of theuser; detecting a possible speech onset from the audio input; ceasingthe at least one desired sequence of one or more spoken phrases inresponse to a detection of the possible speech onset; and managing aninteraction between the at least one desired sequence of one or morespoken phrases and the audio input, wherein the interaction is dependentupon the timing of at least one recognition result relative to acessation of the at least one desired sequence.
 12. The method of claim11, further comprising restarting or not restarting the at least onedesired sequence of one or more spoken phrases in response to the timingposition of receipt of the recognition result.
 13. The method of claim12, wherein restarting the at least one desired sequence of one or morespoken phrases further comprises altering the wording or intonation ofthe at least one desired sequence of one or more spoken phrases.
 14. Themethod of claim 12, wherein restarting the at least one desired sequenceof spoken phrases further comprises restarting the at least one desiredsequence of spoken phrases from a point that is not a beginning point ofthe at least one desired sequence of spoken phrases.
 15. The method ofclaim 12, wherein restarting the at least one desired sequence of spokenphrases further comprises restarting the at least one desired sequenceof spoken phrases from a point that is substantially near to where thedesired sequence of one or more spoken phrases ceased.
 16. The method ofclaim 11, further comprising adjusting an amplitude of the at least onedesired sequence of one or more spoken phrases in response to a possiblespeech onset, wherein ceasing the at least one desired sequence of oneor more phrases is achieved by a modulation of amplitude over time. (D3)17. A method for interactive machine-to-person dialog comprising:verbalizing at least one desired sequence of one or more spoken phrases;enabling a user to hear the at least one desired sequence of one or morespoken phrases; receiving audio input from the user or an environment ofthe user; detecting a possible speech onset from the audio input;ceasing the at least one desired sequence of one or more spoken phrasesin response to a detection of possible speech onset at a point whereonset occurred while the desired sequence was being verbalized; andmanaging a continuous interaction between the at least one desiredsequence of one or more spoken phrases and the audio input, wherein theinteraction is dependent upon at least one recognition result andwhether the desired sequence of one or more spoken phrases was ceased ornot ceased.
 18. The method of claim 17, wherein in response to a lowconfidence recognition result, a subsequent desired sequence of one ormore spoken phrases does not cease after a detection of a subsequentpossible speech onset.
 19. The method of claim 18, wherein thesubsequent desired sequence of one or more spoken phrases issubstantially the same as the desired sequence of one or more spokenphrases.
 20. The method of claim 18, further comprising, in response toa subsequent low confidence recognition result, receiving audio inputwhile continuing to verbalize the at least one desired sequence of oneor more spoken phrases, and in response to a subsequent high confidencerecognition result, the subsequent desired sequence of one or morespoken phrases ceases after detection of possible speech onset.